Skip to content

Conversation

jeromekelleher
Copy link
Contributor

Supporting uncompressed VCF is a bit of a faff, but worth the trouble. Here's a WIP for that.

cc @hyanwong

@hyanwong
Copy link

Oh, that's neat. Thanks @jeromekelleher. From my reading here this is supporting non-indexed VCFs, rather than simply non-compressed VCFs, right?

@jeromekelleher
Copy link
Contributor Author

Compressed or uncompressed is fine.

@jeromekelleher
Copy link
Contributor Author

You're right though, it's the indexing that matters

@jeromekelleher jeromekelleher changed the title Uncompressed VCF is possible Support indexed VCFs Mar 28, 2025
@jeromekelleher jeromekelleher changed the title Support indexed VCFs Support unindexed VCFs Mar 28, 2025
@coveralls
Copy link
Collaborator

coveralls commented Mar 28, 2025

Coverage Status

coverage: 98.867% (+0.05%) from 98.816%
when pulling b3af0dc on jeromekelleher:uncompressed-vcf
into 1f9274b on sgkit-dev:main.

@jeromekelleher
Copy link
Contributor Author

OK, I think this is ready to go. We can now support VCFs that don't have an index and (therefore) can also support uncompressed VCF text. The rules are

  1. Each unindexed file corresponds to 1 partition (so limits parallelism during explode)
  2. We cannot have multiple contigs within an uncompressed file

Other than that, everything works as before. We can mixed indexed and unindexed files no problem.

@hyanwong
Copy link

Woohoo. This is great. @AprilWei001 will be pleased.

Copy link
Contributor

@tomwhite tomwhite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! Just a couple of minor comments.


def contig_record_counts(self):
if self.index is None:
return {self.sequence_names[0]: np.inf}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this infinite?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent question! Fixed to use the correct constant.

TABIX = ".tbi"


class IndexedVcf(contextlib.AbstractContextManager):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is now a bit misleading since the VCF is not necessarily indexed. The class is used for partitioning, iterating over variants, and getting metadata about number of variants. I can't think of a better name though!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good point. I've renamed to "VcfFile". It's good enough.

@jeromekelleher jeromekelleher merged commit 55a7623 into sgkit-dev:main Mar 31, 2025
16 checks passed
@jeromekelleher jeromekelleher deleted the uncompressed-vcf branch March 31, 2025 14:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants